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Abstract 



In this paper, we propose a data representation model that demonstrates hierarchi- 
cal feature learning using nsNMF. We extend unit algorithm into several layers. 
Experiments with document and image data successfully discovered feature hier- 
archies. We also prove that proposed method results in much better classification 
and reconstruction performance, especially for small number of features. 

1 Introduction 

In order to understand complex data, hierarchical feature extraction strategy has been used [1]. One 
best known algorithm is Deep Belief Network (DBN) introduced in 2006 [2]. With the success of 
training deep architectures, several variants of deep learning have been introduced [3]. Although 
these multi-layered algorithms take hierarchical approaches in feature extraction and provide effi- 
cient solution to complex problems, they do not provide us the relationships of features in form of 
hierarchies that are learned throughout the hierarchical structure. 

In this paper, we propose a hierarchical data representation model, hierarchical multi-layer non- 
negative matrix factorization. (Similar approach has been introduced in [4].) We extend a variant 
of NMF algorithm [5], nsNMF [6] into several layers for hierarchical learning. Here, we demon- 
strate intuitive feature hierarchies present in the data set by learning relationships between features 
across layers. We also prove that instead of one step learning, hierarchical approach learns more 
meaningful and helpful features, which leads to better distributed representations, and results in bet- 
ter performance in classification and reconstruction for small number of features, which guarantees 
reduced loss of performance, even when representing data in small dimensions. 

2 Non-smooth non-negative matrix factorization (nsNMF) 

Proposed network is constructed by stacking nsNMF [6] into several layers. Non-smooth non- 
negative matrix factorization (nsNMF) is a variant of NMF that restricts sparsity constraint. Basic 
NMF decomposes non-negative input data X into non-negative W and H, which are features and cor- 
responding coefficients or data representation respectively. It aims to reduce error between original 



data X and its reconstruction WH:C= ^||X-WH|| 2 = \ Y™ =1 £?=i(*ij - ELi W ik H kj ) 2 . 



To apply sparsity constraint to standard NMF, a sparsity matrix S is introduced in [6]: S = (1 — 
O)l(k) + | ones (A:), k is number of features, and is parameter for smoothing effect, in range of 
to 1 . I(k) is identity matrix of size k x k, and ones(k) is a matrix of size k x k with all components 
of Is. We smooth a matrix by multiplying it with S. The closer 6 is to 1, more smoothing effect is 
applied. During alternative update, we smooth H matrix by multiplying S and H during iterations 
as H=SH. To compensate the loss of sparsity, W becomes sparse. 
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Figure 1: Concept hierarchies in Reuters, (a) Experimental results, and (b) diagram of result in (a). 



3 Multi-layer architecture 

The proposed hierarchical multi-layer NMF structure comprise of several layers of unit algo- 
rithm. We first train each layer separately. We process outcome of each layer to get K^. 

^kj ~ f ( jffi )' wnere ^kj — Y^j>=i ~ J ^ l ~> * s nonlinear function, and I denotes in- 
dex of layer, I = 1,2, ...L. The superscript of each term denotes layer index. Processed data 
representation of is used as input to next layer. Using nsNMF, is decomposed into 
W^ +1 ) and H^ +1 ):K^) « w(^+ 1 )h^ +1 ). Then, we use outcome of separate training as ini- 
tialization, and train the whole network jointly. The cost function for joint training is described: 

C = \ Et i E"=i(*y - ELi W^Hff) 2 , where tfg is the reconstruction of ffg\which 
can be computed via back propagation of errors from the last layer to the I th layer: H( L_1 ) w 
M (L-i) f -i J W (L)g)j vii) go) ~ M (i) f -i (w^^y where = H( l ). Z" 1 ^) 

is inverse nonlinear function, (more details on the actual update computation is described in Ap- 
pendix [X]). After training until the last layer, final data representation H^ L ^ is acquired. This is 
the activation information of complex features, which is the integration of features throughout the 
layers, W^W^.-.W^. 

For more detailed explanation, refer to the pseudo-code for the training procedure in Appendix [B] 

4 Document data feature hierarchies 

We applied our proposed network to document database. We used "Reuters-21578 collection, distri- 
bution l.O'Qas database. We sorted top 10 categories from ModApte split, conducted pre-processing 
of removing stop-words, and reduced dimension to 1000. There are 5786 and 2587 document sam- 
ples for training data, and test data. We constructed two-layered network with number of hidden 
neurons as 160. 

We observed how concepts form hierarchies in document data in Figure[T](a). First, second, and third 
W^ 1 ) features contain words related to 'oil production' (exploration, pipeline, production, industry), 
'oil contract' (contract, purchase, barrel, prices), and 'oil refinery processing' (refinery, reserves, 
pipeline, petroleum), respectively. These sub-class topic features are combined together and develop 
into one broader topic 'oil.' With this combination relationship of features, we can figure out that 
those three seemingly independent features can be re-categorized under the same broader topic. (The 

x The Reuters-21578, Distribution 1.0 test collection is available from David D. Lewis professional home 
page, currently: http:// www.research.att.com/^lewis| 
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Figure 2: Reconstruction error (left) and classification rate (right) of (a) Reuters and (b) MNIST. 
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Figure 3: (a) Reconstruction by shallow network (first row) and proposed network (third row), 
with original input (second row), (b) Final data representation (visualized by PC A) comparison of 
shallow network (middle) and proposed network (right), in comparison to raw data (left) 



concept hierarchy learned in Reuters: sub-categories of 'oil production', 'contract', and 'refinery 
processing' exist under 'oil' category.) 

Furthermore we analyzed reconstruction and classification performance as shown in Figure [2] (a). 
The proposed hierarchical feature extraction method results in much better classification and recon- 
struction, especially for small number of features, compared to extracting features at one step. This 
proves the efficiency and effectiveness of our proposed approach in learning of features. 

We also applied our network to handwritten digit image The final data representation H 

displayed distinct activation patterns for samples of the different classes, as a result of successful 
learning of feature hierarchy, which determines the combination of low level features in forming 
of distinct class features. In Figure [2] (b), the reconstruction error and classification performance 
also demonstrate better performance of our proposed method in small number of dimensions. In 
Figure [3] (a), we can observe sparser and clear reconstruction of our proposed network. The Fisher 
discriminant values of final data representation of the shallow network and our proposed network 
were 0.51 and 0.61 respectively. We can infer that proposed network learns more meaningful and 
helpful features so that it results in better distributed (clustered) representation of data. We can also 
check this via the visualization of H^ L ^ to 2-D domain shown in Figure|3](b). 



5 Conclusion 

In this paper, we proposed a hierarchical data representation model, hierarchical multi-layer NMF 
by stacking nsNMF into several layers. We demonstrated hierarchical approach in learning of the 
features. There are mainly two findings of our research. Taking hierarchical learning by stack- 
ing NMFs: 1. reveals intuitive feature hierarchies (subcategories) by learning feature relationships 
throughout the layers, and 2. learns more meaningful features compared to one- step learning, (as a 
result, our proposed method results in much better classification and reconstruction performance, 
provided small number of dimensions for data representation.) We expect our proposed method to 
be applied to various types of data for discovering underlying feature hierarchies and at the same 
time, maintain reconstruction and classification performance even with small number of features for 
data representation. 



2 Available at: http://yann.lecun.com/exdb/mnist/ 
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A Appendix: Actual update computation 

Continued from Section[3] the actual computation is done as described in ([I}. 

V Jik _j tt(1) , rr(l) V / 



W$ <" W$ -) ( %k , and H$ <- H$ - f -= jj^, where (la) 



Nu< " " < (wC-^NuC-')) (mC- 1 )/" 1 ' (WC'HO)) offteiwe ° b) 

De = |(w( | - 1 » T De( | - 1 >)0(M('- 1 >/- 1 '(w("H( , »)) rtfcerwe ° C) 

Here, X = W^Ht 1 ). is the reconstruction of H^, which can be computed via back propagation of 
errors from the last layer to the I th layer as shown in {2|. 

_ fH (i) cri = i 

H ~ \m« © r 1 (w< i+i )H(^)) ifi = L-i,...,i (2) 

is a matrix of column- wise mean of H (z) ,and/ _1 (-) is inverse nonlinear function. 



X if 1 = 1 
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B Appendix: Pseudo-code for training procedure of proposed network 

%% Separate training of layers in extending mode 
for / = 1 : L do 

Randomly initialize W w and H w 

if / = 1 then 

end if 

for iteration = 1 : (until convergence) do 
w (i) . w (i) (kC- 1 )^) 7 )^ 

yv ik ^~ VV ik (w (l) H (I)H(0 T ) ifc 
rrd) , rr(l) JW^JC^^ 
(w(0 T w(0 H (0) fcj . 

end for 



k3 ~ J v< 

end for 

%% Joint training the whole network 
Use W (0 andH (0 ,and use M w acquired from above 
for iteration = 1 : (until convergence) do 
for / = 1 : L do 
if / =_L then 

H(0 = 
else _ 

H(0 = M<*> r 1 (w (z+1) H?+i)) 
which can be written in full length as: 

H(0 = M w /- 1 (W ( ^ +1) (M (Z+1) f- 1 (W (l+2 \...(M^ L - 1 ^ /^(W^H^))))) 
end if 

if I = 1 then 

Nu (0 = X 
De< z > = X 

which can be written in full length as: 

X = W (1) HW = W (1) (M^ /^(W^tM^ /-^W^H... 
M (L_1) / _1 (W (L) H (L) )))))) 

else 

Nu"» = (wC-^NuC" 1 ') © (m"- 1 ' r 1 ' (W('>H<")) 
De® = (w< i - 1 ) T De( i - 1 )) (m^" 1 )/" 1 ' (w«H«)) 



end if 



H(0 T ) 



[De(OH(0 T 



(0 . E|-(0 V /fej 



(w(0 T De(0) 



end for 
end for 
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